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Abstract —Ray tracing is a technique for generating an image 
by tracing the path of light through pixels in an image plane 
and simulating the effects of high-quality global illumination at 
a heavy computational cost. Because of the high computation 
complexity, it can’t reach the requirement of real-time rendering. 
The emergence of many-core architectures, makes it possible to 
reduce significantly the running time of ray tracing algorithm by 
employing the powerful ability of floating point computation. In 
this paper, a new GPU implementation and optimization of the 
ray tracing to accelerate the rendering process is presented. 

Index Terms —Radiosity GPU OpenCL Ray Tracing 

1. Introduction 

Photorealistic rendering is an rendering process of the 
reflection effects of real shadow rays. Unlike the pipeline of 
real-time rendering, it requires to achieve the high quality 
of reality to guarantee its authenticity is hard to verify, thus 
realistic illumination and materials need quite complicated and 
accurate simulation. Physics-based rendering technology can 
achieve photo-realistic rendering, but the huge computational 
cost makes real-time photorealistic rendering of a image can 
not be generated in time. On the contrary of the pipeline 
rendering, the former sacrifices reality for high-speed ren¬ 
dering and real-time performance. The latter attenuates high¬ 
speed rendering and real-time performance to dramatically 
enhance the effect of reality. Because of these properties of 
ray tracing, it has been widely applied in film, advertising, 
animation, and other visual industries. Ray tracing is other than 
the widely used technique in interactive computer graphics, 
rasterization. Based on physical optics theorem, ray tracing 
can simulate the light propagation in the real world and 
calculate the distribution of radiation. Because of the heavy 
computational complexity of simulating the light propagation, 
rendering an image usually takes tens of minutes to several 
hours, so to product the high-quality real images, we generally 
requires specialized high-performance equipment. Before the 
GPU computing was proposed, ray tracing technique has 
always been a very time consuming work. 

In recent years, the emergence of parallel computing based 
on GPU architectures, many researchers are interested in 
employing the powerful ability of floating point computation 
to improve the efficiency of ray tracing algorithm because 
of the low entry threshold. Unlike the design philosophy of 
CPU architecture, GPU is generally comprised of hundreds 
of thousands of stream processors. Many-core architecture is 


split into a large number of much smaller cores and each core 
is an in-order, heavily multi-threaded, single-instruction issue 
processor that shares its control and instruction cache with 
other cores. So data-intensive applications can easily harness 
the potential power of GPUs. Because there are a large number 
of calculation in ray tracing algorithm, for example, traverse, 
circulation and intersection, all of these calculation can be 
decomposed into independent subtasks to execute in parallel. 
It is not difficult to imagine how the ray tracing’s performance 
varies under GPU architecute. 

In modern software processes, the program sections often 
exhibit a rich amount of data parallelism, a property that allows 
many arithmetic operations to be performed on program data 
structures in a simultaneous manner. CUBA devices accelerate 
the execution of these applications by obtaining a large amount 
of data parallelism. Besides CUBA, several tools including 
language, library, and compiler directives are still used. For 
example, OpenCL, which is a framework for writing programs, 
can be executed across heterogeneous platforms consisting of 
CPUs, GPUs, digital signal processors (DSPs), and other pro¬ 
cessors. Considering good characteristics of OpenCL, such as 
flexibility, portability, versatility, we used OpenCL to optimize 
and accelerate ray tracing algorithm. 

II. The problem 

Since the vast majority of ray tracing applications today per¬ 
form on CPU architecure, it makes the efficiency of ray tracing 
have direct relation with Cycles Per Instruction (CPI) and cycle 
rate. CPI is determined by Instruction Set Architecture (ISA). 
Because of the bottleneck of Moore’s law, CPU manufacturers 
have gradually reached the limit of clock frequency. Thus, 
serial program can not essentially improve the efficiency of 
ray tracing. However, today it has not taken a gigantic leap 
forward even in multi-core CPU architecture. 

To solve these problems, many researchers designed lots 
of the acceleration of ray tracing algorithm, including space 
partition, bounding box, spatial sorting, and so forth. Because 
these methods exclude those objects and lights who do not 
involve in ray tracing, the optimized scene do greatly reduce 
the time overhead of ray tracing. But, more or less, every 
optimization method has limitations. For example, space par¬ 
tition’s efficiency is generally limited by intensive scenes. 

On the other hand, there are hardwares specifically designed 
for ray tracing. For example, light processing unit developed 



by Stanford, but poor universality, only a few people can 
use these dedicated hardwares. Another solution is distributed 
computing using cluster. It splits the problem into independent 
subproblem and these tasks will be mapped into the different 
computer nodes. The cost of that is significant, in the mean¬ 
time, it’s extremely hard to guarantee load balancing. 

It is becoming increasingly common to use a general 
purpose graphics processing unit as a modified form of 
stream processor. This concept turns the massive computa¬ 
tional power of a modern graphics accelerator’s shader pipeline 
into general-purpose computing power. GPU can be used for 
many types of embarrassingly parallel tasks including ray 
tracing. They are generally suited to high-throughput type 
computations that exhibit data-parallelism to exploit the wide 
vector width SIMD architecture of the GPU. 

In general, GPU allows to launch tens of thousands of 
lightweight threads to execute the same kernel function simul¬ 
taneously. with this feature, independent lightweight threads 
can take the place of multi-level iterations and massively 
parallel ray tracing algorithm. So GPU can greatly improve 
the efficiency of ray tracing. 

III. Ray tracing 

In computer graphics, ray tracing is a technique for gen¬ 
erating an image by tracing the path of light through pixels 
in an image plane and simulating the effects of its encounters 
with virtual objects. If the ray intersects with some objects, 
according to the theorem of radiosity, the color value of the 
related point in the image plane can be calculated by this 
method using some parameters, for example, materials, normal 
vector at the intersection point, light distribution, and so on. 
More specifically, to get the color value at one point, it is a 
critical part to calculate the radiance of the opposite direction 
of the ray casting at this point. 


reflect light of Wq to L. Through the Lambert’s emission law, 
the equation is derived as follows; 


L = 




( 1 ) 


dA cos 6dw 

As Eq. ([^ shows, means the radiation power which 
emits from the surface element dA to the solid angle dw. 
Through the formula of irradiance: 


F=^ 

dA 


( 2 ) 


In considering of the premise of incident direction, Eq. o 
is substituted into Eq. 0) as follows: 


dEi{p,Wi) = Li{p,Wi) cos Odwi (3) 

In Eq. 0, the received iiTadiance dEi{p,Wi) at the point 
p can be calculated by the radiance Li{p,Wi) at that point. 
Obviously, incident angle 6 is the other impact factor to the 
final result. Eor general materials, irradiance is proportional 
to radiance, that is, with greater radiosity, comes greater 
reflection of radiosity at the same point. Thus, the following 
relation holds certainly; 


dLo (p, Wo) oc dEi {p, Wi ) (4) 

If bidirectional reflectance distribution function (BRDE) is 
used to define the scale factor, Eq. 0 can be transformed as 
Eq. 0: 

dLo{p, Wo) = fr{p, Wi, Wo)dEi{p, Wi) (5) 

And then, Eq. 0 is substituted into Eq. 0 as following: 

dLo{p,Wo) = frip,W^,Wo)Li{p,Wi)cOS0dWi ( 6 ) 



Fig. 1. In the radiosity model, Wi and Wo represent the directions of incident 
light and emergent light. 

As shown in Eig. [T] point p is an random point on the 
object surface. It’s the origin of an eclipse and that eclipse 
is the integration region of point p. By convention, Wi points 
to light source or one sampling point on its surface, and Wo 
can finally reach the viewing plane. Set the radiance along the 


If the surface of object is self illuminated material, besides 
the reflection of radiosity, the surface emits radiance also 
include it emits radiosity by itself. Set self illuminated material 
emits radiance to Le- is added into Eq. 0 as below: 

dLo{p,Wo) = Lf, + fr{p,Wi,Wo)LPp,Wi)cos9dwi (7) 

As shown in Eig. [T] assume that consider only the single 
incident direction Wi, Eq. can calculate the integration of 
radiance in any directions. However, it’s impossible that the 
irradiance at point p simply originate from single direction. 
In reality, point p would receive irradiance of all directions in 
the hemisphere region above that point. Radiance is obtained 
by integration of Eq. 0 as follows: 

Lo{p,Wo)=Le+ fr{p,Wi,Wo)Li{p,Wi) cosOdwi (8) 

J 27r+ 

Although Eq. 0 provides the equation to calculate the 
whole radiance in the surface of objects, apparently it can’t 
be solved for straight away. There are a couple of reasons 
for this. First, Eq. 0 contains a constant integral limitation 









which can be seen as Fredholm integral equation of the second 
kind. Second, because computer can not precisely simulate 
irradiance of all directions in the hemisphere region. Even 
in the global illumination model, it is unable to trace all the 
lights at one point of object’s surface. Thus, the mathematical 
model described in Eq. should be simplified. We can 
recursively trace a small amount of indirect reflected light 
on object’s surface. Recursion depth depends on the number 
of light reflection. So the majority of integral calculation is 
concentrated on radiosity of sampling points on the surface of 
light source, as shown in Eig. 



Fig. 2. In local illumination model, source lights all have radiosity effects 
on point p. 


In Eig. 1^ to get the radiance along light to viewing plane 
at point p, calculating the received irradiance of that point 
using Eq. 0 is necessary. Point p can receive the whole 
radiosity from light source no.l and partial that from no.2. 
The process of integration need to traverse all sampling points 
on the surface of both regions and determine one by one 
whether the light is obstructed by objects. Eor example, the 
object in Eig. blocked some radiosity from light source 
no.2. The blocked radiation did not make a contribution to the 
lighting of point p at all. Afterwards, integrating the received 
radiosity at point p. This process is generally the most time- 
consuming part of ray tracing which depends on the number 
of light sources and geometries, the intersection complexity 
of geometries, the number of sampling points on the surface 
of light source and so on. If the process of rendering using 
anti-aliasing technology, each pixel will cast more light and 
finally the pixel will take the average value of these colors. The 
pseudocode of local illumination ray tracing can be depicted 
as follows; 


As shown in Alg. 19 multilevel nest iterations exhibit a rich 
amount of data parallelism. The pseudocode only considers the 
radiosity point p received directly. In the global illumination, 
besides radiosity from source light, it also includes reflection 
radiosity from objects, so the program need to be modified 
as an recursive version. However, the performance of serial 
execution is inefficient. 


Algorithm 1 Local Illumination Ray Rracing 
1: for each light of each pixel in the scene do 
2: for each object in the scene do 

3: for one light intersects with one object do 

4: for each sampling point of each source light do 

5: emit a shadow light r from point p to that 

sampling point 

6: for each object in the scene do 

7: if r intersects with one object then 

8 : break 

9: else 

10: calculate the irradiance using Eq. 0 

11: end if 

12: accumulate all the received irradiance 

13: end for 

14: end for 

15: end for 

16: accumulate the color value of each light 

17: end for 

18: take the average value of these pixel’s colors 

19: end for 


IV. Paralel optimization 
A. parallel ray tracing 

In traditional global illumination model, when a single light 
intersects with object in the scene, it will produce some of 
secondary lights. Some secondary are shadow lights which can 
be used to check the visibility of light sources. Besides that, 
all the others are treated as new generation lights to spread 
again (intersection test and radiosity calculation), as shown in 
Fig. [3] 



Fig. 3. Lights occurr radiosity on the other objects through reflection and 
refraction 


Recursion method is used to trace secondary lights until they 
reach the maximum recursion depth. Secondary lights occurr 
radiosity on the other objects, so global illumination is also 
called indirect illumination. 


















Since OpenCL kernel don’t support the property of recur¬ 
sion, recursion need to be transformed into iterations and the 
number of iterations is used to simulate the recursion depth. 
When a single light reaches a point on the surface of one 
object, derived shadow lights at intersection only need to 
sample every light source once. They traverse all the sampling 
points of each light source is unnecessary, when all the lights 
recursively sample the surface of light source just once, the 
process of rendering will be suspended and the image will be 
updated. The next ray tracing will select another sampling 
point and start the same work at once. Then overlapping 
new color value onto the pixel. Iterations are to simulate the 
integration of the radiosity of sampling points on the light 
source’s surface. 

In Fig. 1^ under GPU architecute, each kernel thread traces 
a single light and it can obtain the final color value of the 
light. When all threads execute the kernel function once, the 
intermediate value will be added into the pixels. To render a 
image, the same kernel function should be launched iteratively. 



NDRange size X 



Fig. 4. Ovei-view of parallel ray tracing algorithm using GPU 


B. GPU Kernel Function 

To simplify the programming model, this paper only study 
the rendering of sphere. The implicit equation of sphere can 
be represented in vectorial form. 

{p — c) ■ {p — c) — = 0 (9) 

Linear equation can be expressed as below; 

o + td (10) 

Eq. ([T0|l is substituted into Eq. (j^ as following; 


{d ■ d)t^ + 2[{o — c) ■ d]t + {o — c) ■ {o — c) — = 0 (11) 

Eq. ([n) can be regarded as a quadratic equation. So t is a 
dependent variable, the formula can be transformed as follows; 


—& ± — 4ac 

2a 


( 12 ) 


Note that the variables a, b and c can be calculated as below; 
a = d ■ d, b = 2(o — c) ■ d and c = [o — c) ■ [o — c) — r^. 


Eq. determines whether a single light intersects with 
sphere. If so, the coordinate of intersection can be calculated. 
To calculate the process of intersection more efficiently. We 
need to transform the equation into OpenCL kernel function. 
In combination with Eq. massively parallel integration can 
achieve the goal of improving the efficiency of rendering. 

V. Results and Discussion 

Tests were conducted on a system composed of an Intel 
Core i7-2720QM CPU running at 2.20GHz, with 1600MHz 
and 4GB DDRS DRAM. This platform also had a ATI Radeon 
HD 6750M GPU. The scene file provided by David Bucciarelli 
and the scene resolution is 640 x 480. 



Fig. 5. The first rendering took only 0.508 seconds to generate the image. 


In Pig. the image was generated using local illumination 
model while a single cycle of rendering was finished. Since 
parallel rendering once only selects one sampling point, partial 
region of the image produced amounts of black noise. When 
more and more cycles are completed, sampling points will 
cover most of the pixels in the scene, thus, the image will 
show better rendering effects (see Pig. |^. As time goes on, 
more sampling points will be rendered, the image will become 
more accurate. 

In Pig. 1^ the image was generated using global illumination 
model in the same scene. Its recursion depth was 6 and it took 
20 seconds to generate this image. The experimental result 
shows that parallel ray tracing based on GPU significantly 
improves rendering effects. Here, as shown in Pig. a 
comparative evaluation of ray tracing to process the same 
number of sampling points under two different platforms, 
multi-core (i7-2720QM CPU) and many-core (ATI Radeon 
HD 6750M GPU) is proposed. 

VI. Conclusion 
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Fig. 6. After 6 seconds, the image showed better rendering effects. 
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Fig. 7. Overview of radiosity model. 





Fig. 8. Overview of radiosity model. 












